Auto1 Data Science Challenge

Agenda

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Exploratory Data Analysis
  • Building a Random Forest Model
  • Evaluation

Business Understanding

A leading car trading company which connects both sellers and buyers through a online trading platform wants to analyze their data to improve its business.

Problem Statement

The objectives is to :

  • Building a regression model to predict the price of a car given the car features amd insurance related features.
  • Building a regression model to predict the normalized-losses of a car based on the car faetures.

Data Understanding

In [2]:
# Importing libraries

import pandas as pd
import numpy as np

import missingno as msno

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from fancyimpute import KNN
import itertools

import graphviz
import shap

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/graphviz-2.38/release/bin/'

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb


from sklearn.metrics import mean_absolute_error, r2_score
In [3]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199
In [65]:
# Reading the data 
car_Data = pd.read_csv('Auto1-DS-TestData.csv')
In [19]:
car_Data
Out[19]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.00 111 5000 21 27 13495
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.00 111 5000 21 27 16500
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.00 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.00 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.00 115 5500 18 22 17450
5 2 NaN audi gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.50 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.50 110 5500 19 25 17710
7 1 NaN audi gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.50 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.30 140 5500 17 20 23875
9 0 NaN audi gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.00 160 5500 16 22 NaN
10 2 192 bmw gas std two sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.80 8.80 101 5800 23 29 16430
11 0 192 bmw gas std four sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.80 8.80 101 5800 23 29 16925
12 0 188 bmw gas std two sedan rwd front 101.2 176.8 64.8 54.3 2710 ohc six 164 mpfi 3.31 3.19 9.00 121 4250 21 28 20970
13 0 188 bmw gas std four sedan rwd front 101.2 176.8 64.8 54.3 2765 ohc six 164 mpfi 3.31 3.19 9.00 121 4250 21 28 21105
14 1 NaN bmw gas std four sedan rwd front 103.5 189.0 66.9 55.7 3055 ohc six 164 mpfi 3.31 3.19 9.00 121 4250 20 25 24565
15 0 NaN bmw gas std four sedan rwd front 103.5 189.0 66.9 55.7 3230 ohc six 209 mpfi 3.62 3.39 8.00 182 5400 16 22 30760
16 0 NaN bmw gas std two sedan rwd front 103.5 193.8 67.9 53.7 3380 ohc six 209 mpfi 3.62 3.39 8.00 182 5400 16 22 41315
17 0 NaN bmw gas std four sedan rwd front 110.0 197.0 70.9 56.3 3505 ohc six 209 mpfi 3.62 3.39 8.00 182 5400 15 20 36880
18 2 121 chevrolet gas std two hatchback fwd front 88.4 141.1 60.3 53.2 1488 l three 61 2bbl 2.91 3.03 9.50 48 5100 47 53 5151
19 1 98 chevrolet gas std two hatchback fwd front 94.5 155.9 63.6 52.0 1874 ohc four 90 2bbl 3.03 3.11 9.60 70 5400 38 43 6295
20 0 81 chevrolet gas std four sedan fwd front 94.5 158.8 63.6 52.0 1909 ohc four 90 2bbl 3.03 3.11 9.60 70 5400 38 43 6575
21 1 118 dodge gas std two hatchback fwd front 93.7 157.3 63.8 50.8 1876 ohc four 90 2bbl 2.97 3.23 9.41 68 5500 37 41 5572
22 1 118 dodge gas std two hatchback fwd front 93.7 157.3 63.8 50.8 1876 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6377
23 1 118 dodge gas turbo two hatchback fwd front 93.7 157.3 63.8 50.8 2128 ohc four 98 mpfi 3.03 3.39 7.60 102 5500 24 30 7957
24 1 148 dodge gas std four hatchback fwd front 93.7 157.3 63.8 50.6 1967 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6229
25 1 148 dodge gas std four sedan fwd front 93.7 157.3 63.8 50.6 1989 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6692
26 1 148 dodge gas std four sedan fwd front 93.7 157.3 63.8 50.6 1989 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 7609
27 1 148 dodge gas turbo NaN sedan fwd front 93.7 157.3 63.8 50.6 2191 ohc four 98 mpfi 3.03 3.39 7.60 102 5500 24 30 8558
28 -1 110 dodge gas std four wagon fwd front 103.3 174.6 64.6 59.8 2535 ohc four 122 2bbl 3.34 3.46 8.50 88 5000 24 30 8921
29 3 145 dodge gas turbo two hatchback fwd front 95.9 173.2 66.3 50.2 2811 ohc four 156 mfi 3.60 3.90 7.00 145 5000 19 24 12964
30 2 137 honda gas std two hatchback fwd front 86.6 144.6 63.9 50.8 1713 ohc four 92 1bbl 2.91 3.41 9.60 58 4800 49 54 6479
31 2 137 honda gas std two hatchback fwd front 86.6 144.6 63.9 50.8 1819 ohc four 92 1bbl 2.91 3.41 9.20 76 6000 31 38 6855
32 1 101 honda gas std two hatchback fwd front 93.7 150.0 64.0 52.6 1837 ohc four 79 1bbl 2.91 3.07 10.10 60 5500 38 42 5399
33 1 101 honda gas std two hatchback fwd front 93.7 150.0 64.0 52.6 1940 ohc four 92 1bbl 2.91 3.41 9.20 76 6000 30 34 6529
34 1 101 honda gas std two hatchback fwd front 93.7 150.0 64.0 52.6 1956 ohc four 92 1bbl 2.91 3.41 9.20 76 6000 30 34 7129
35 0 110 honda gas std four sedan fwd front 96.5 163.4 64.0 54.5 2010 ohc four 92 1bbl 2.91 3.41 9.20 76 6000 30 34 7295
36 0 78 honda gas std four wagon fwd front 96.5 157.1 63.9 58.3 2024 ohc four 92 1bbl 2.92 3.41 9.20 76 6000 30 34 7295
37 0 106 honda gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2236 ohc four 110 1bbl 3.15 3.58 9.00 86 5800 27 33 7895
38 0 106 honda gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2289 ohc four 110 1bbl 3.15 3.58 9.00 86 5800 27 33 9095
39 0 85 honda gas std four sedan fwd front 96.5 175.4 65.2 54.1 2304 ohc four 110 1bbl 3.15 3.58 9.00 86 5800 27 33 8845
40 0 85 honda gas std four sedan fwd front 96.5 175.4 62.5 54.1 2372 ohc four 110 1bbl 3.15 3.58 9.00 86 5800 27 33 10295
41 0 85 honda gas std four sedan fwd front 96.5 175.4 65.2 54.1 2465 ohc four 110 mpfi 3.15 3.58 9.00 101 5800 24 28 12945
42 1 107 honda gas std two sedan fwd front 96.5 169.1 66.0 51.0 2293 ohc four 110 2bbl 3.15 3.58 9.10 100 5500 25 31 10345
43 0 NaN isuzu gas std four sedan rwd front 94.3 170.7 61.8 53.5 2337 ohc four 111 2bbl 3.31 3.23 8.50 78 4800 24 29 6785
44 1 NaN isuzu gas std two sedan fwd front 94.5 155.9 63.6 52.0 1874 ohc four 90 2bbl 3.03 3.11 9.60 70 5400 38 43 NaN
45 0 NaN isuzu gas std four sedan fwd front 94.5 155.9 63.6 52.0 1909 ohc four 90 2bbl 3.03 3.11 9.60 70 5400 38 43 NaN
46 2 NaN isuzu gas std two hatchback rwd front 96.0 172.6 65.2 51.4 2734 ohc four 119 spfi 3.43 3.23 9.20 90 5000 24 29 11048
47 0 145 jaguar gas std four sedan rwd front 113.0 199.6 69.6 52.8 4066 dohc six 258 mpfi 3.63 4.17 8.10 176 4750 15 19 32250
48 0 NaN jaguar gas std four sedan rwd front 113.0 199.6 69.6 52.8 4066 dohc six 258 mpfi 3.63 4.17 8.10 176 4750 15 19 35550
49 0 NaN jaguar gas std two sedan rwd front 102.0 191.7 70.6 47.8 3950 ohcv twelve 326 mpfi 3.54 2.76 11.50 262 5000 13 17 36000
50 1 104 mazda gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1890 ohc four 91 2bbl 3.03 3.15 9.00 68 5000 30 31 5195
51 1 104 mazda gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1900 ohc four 91 2bbl 3.03 3.15 9.00 68 5000 31 38 6095
52 1 104 mazda gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1905 ohc four 91 2bbl 3.03 3.15 9.00 68 5000 31 38 6795
53 1 113 mazda gas std four sedan fwd front 93.1 166.8 64.2 54.1 1945 ohc four 91 2bbl 3.03 3.15 9.00 68 5000 31 38 6695
54 1 113 mazda gas std four sedan fwd front 93.1 166.8 64.2 54.1 1950 ohc four 91 2bbl 3.08 3.15 9.00 68 5000 31 38 7395
55 3 150 mazda gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl NaN NaN 9.40 101 6000 17 23 10945
56 3 150 mazda gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl NaN NaN 9.40 101 6000 17 23 11845
57 3 150 mazda gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2385 rotor two 70 4bbl NaN NaN 9.40 101 6000 17 23 13645
58 3 150 mazda gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2500 rotor two 80 mpfi NaN NaN 9.40 135 6000 16 23 15645
59 1 129 mazda gas std two hatchback fwd front 98.8 177.8 66.5 53.7 2385 ohc four 122 2bbl 3.39 3.39 8.60 84 4800 26 32 8845
60 0 115 mazda gas std four sedan fwd front 98.8 177.8 66.5 55.5 2410 ohc four 122 2bbl 3.39 3.39 8.60 84 4800 26 32 8495
61 1 129 mazda gas std two hatchback fwd front 98.8 177.8 66.5 53.7 2385 ohc four 122 2bbl 3.39 3.39 8.60 84 4800 26 32 10595
62 0 115 mazda gas std four sedan fwd front 98.8 177.8 66.5 55.5 2410 ohc four 122 2bbl 3.39 3.39 8.60 84 4800 26 32 10245
63 0 NaN mazda diesel std NaN sedan fwd front 98.8 177.8 66.5 55.5 2443 ohc four 122 idi 3.39 3.39 22.70 64 4650 36 42 10795
64 0 115 mazda gas std four hatchback fwd front 98.8 177.8 66.5 55.5 2425 ohc four 122 2bbl 3.39 3.39 8.60 84 4800 26 32 11245
65 0 118 mazda gas std four sedan rwd front 104.9 175.0 66.1 54.4 2670 ohc four 140 mpfi 3.76 3.16 8.00 120 5000 19 27 18280
66 0 NaN mazda diesel std four sedan rwd front 104.9 175.0 66.1 54.4 2700 ohc four 134 idi 3.43 3.64 22.00 72 4200 31 39 18344
67 -1 93 mercedes-benz diesel turbo four sedan rwd front 110.0 190.9 70.3 56.5 3515 ohc five 183 idi 3.58 3.64 21.50 123 4350 22 25 25552
68 -1 93 mercedes-benz diesel turbo four wagon rwd front 110.0 190.9 70.3 58.7 3750 ohc five 183 idi 3.58 3.64 21.50 123 4350 22 25 28248
69 0 93 mercedes-benz diesel turbo two hardtop rwd front 106.7 187.5 70.3 54.9 3495 ohc five 183 idi 3.58 3.64 21.50 123 4350 22 25 28176
70 -1 93 mercedes-benz diesel turbo four sedan rwd front 115.6 202.6 71.7 56.3 3770 ohc five 183 idi 3.58 3.64 21.50 123 4350 22 25 31600
71 -1 NaN mercedes-benz gas std four sedan rwd front 115.6 202.6 71.7 56.5 3740 ohcv eight 234 mpfi 3.46 3.10 8.30 155 4750 16 18 34184
72 3 142 mercedes-benz gas std two convertible rwd front 96.6 180.3 70.5 50.8 3685 ohcv eight 234 mpfi 3.46 3.10 8.30 155 4750 16 18 35056
73 0 NaN mercedes-benz gas std four sedan rwd front 120.9 208.1 71.7 56.7 3900 ohcv eight 308 mpfi 3.80 3.35 8.00 184 4500 14 16 40960
74 1 NaN mercedes-benz gas std two hardtop rwd front 112.0 199.2 72.0 55.4 3715 ohcv eight 304 mpfi 3.80 3.35 8.00 184 4500 14 16 45400
75 1 NaN mercury gas turbo two hatchback rwd front 102.7 178.4 68.0 54.8 2910 ohc four 140 mpfi 3.78 3.12 8.00 175 5000 19 24 16503
76 2 161 mitsubishi gas std two hatchback fwd front 93.7 157.3 64.4 50.8 1918 ohc four 92 2bbl 2.97 3.23 9.40 68 5500 37 41 5389
77 2 161 mitsubishi gas std two hatchback fwd front 93.7 157.3 64.4 50.8 1944 ohc four 92 2bbl 2.97 3.23 9.40 68 5500 31 38 6189
78 2 161 mitsubishi gas std two hatchback fwd front 93.7 157.3 64.4 50.8 2004 ohc four 92 2bbl 2.97 3.23 9.40 68 5500 31 38 6669
79 1 161 mitsubishi gas turbo two hatchback fwd front 93.0 157.3 63.8 50.8 2145 ohc four 98 spdi 3.03 3.39 7.60 102 5500 24 30 7689
80 3 153 mitsubishi gas turbo two hatchback fwd front 96.3 173.0 65.4 49.4 2370 ohc four 110 spdi 3.17 3.46 7.50 116 5500 23 30 9959
81 3 153 mitsubishi gas std two hatchback fwd front 96.3 173.0 65.4 49.4 2328 ohc four 122 2bbl 3.35 3.46 8.50 88 5000 25 32 8499
82 3 NaN mitsubishi gas turbo two hatchback fwd front 95.9 173.2 66.3 50.2 2833 ohc four 156 spdi 3.58 3.86 7.00 145 5000 19 24 12629
83 3 NaN mitsubishi gas turbo two hatchback fwd front 95.9 173.2 66.3 50.2 2921 ohc four 156 spdi 3.59 3.86 7.00 145 5000 19 24 14869
84 3 NaN mitsubishi gas turbo two hatchback fwd front 95.9 173.2 66.3 50.2 2926 ohc four 156 spdi 3.59 3.86 7.00 145 5000 19 24 14489
85 1 125 mitsubishi gas std four sedan fwd front 96.3 172.4 65.4 51.6 2365 ohc four 122 2bbl 3.35 3.46 8.50 88 5000 25 32 6989
86 1 125 mitsubishi gas std four sedan fwd front 96.3 172.4 65.4 51.6 2405 ohc four 122 2bbl 3.35 3.46 8.50 88 5000 25 32 8189
87 1 125 mitsubishi gas turbo four sedan fwd front 96.3 172.4 65.4 51.6 2403 ohc four 110 spdi 3.17 3.46 7.50 116 5500 23 30 9279
88 -1 137 mitsubishi gas std four sedan fwd front 96.3 172.4 65.4 51.6 2403 ohc four 110 spdi 3.17 3.46 7.50 116 5500 23 30 9279
89 1 128 nissan gas std two sedan fwd front 94.5 165.3 63.8 54.5 1889 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 5499
90 1 128 nissan diesel std two sedan fwd front 94.5 165.3 63.8 54.5 2017 ohc four 103 idi 2.99 3.47 21.90 55 4800 45 50 7099
91 1 128 nissan gas std two sedan fwd front 94.5 165.3 63.8 54.5 1918 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 6649
92 1 122 nissan gas std four sedan fwd front 94.5 165.3 63.8 54.5 1938 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 6849
93 1 103 nissan gas std four wagon fwd front 94.5 170.2 63.8 53.5 2024 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 7349
94 1 128 nissan gas std two sedan fwd front 94.5 165.3 63.8 54.5 1951 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 7299
95 1 128 nissan gas std two hatchback fwd front 94.5 165.6 63.8 53.3 2028 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 7799
96 1 122 nissan gas std four sedan fwd front 94.5 165.3 63.8 54.5 1971 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 7499
97 1 103 nissan gas std four wagon fwd front 94.5 170.2 63.8 53.5 2037 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 7999
98 2 168 nissan gas std two hardtop fwd front 95.1 162.4 63.8 53.3 2008 ohc four 97 2bbl 3.15 3.29 9.40 69 5200 31 37 8249
99 0 106 nissan gas std four hatchback fwd front 97.2 173.4 65.2 54.7 2324 ohc four 120 2bbl 3.33 3.47 8.50 97 5200 27 34 8949
100 0 106 nissan gas std four sedan fwd front 97.2 173.4 65.2 54.7 2302 ohc four 120 2bbl 3.33 3.47 8.50 97 5200 27 34 9549
101 0 128 nissan gas std four sedan fwd front 100.4 181.7 66.5 55.1 3095 ohcv six 181 mpfi 3.43 3.27 9.00 152 5200 17 22 13499
102 0 108 nissan gas std four wagon fwd front 100.4 184.6 66.5 56.1 3296 ohcv six 181 mpfi 3.43 3.27 9.00 152 5200 17 22 14399
103 0 108 nissan gas std four sedan fwd front 100.4 184.6 66.5 55.1 3060 ohcv six 181 mpfi 3.43 3.27 9.00 152 5200 19 25 13499
104 3 194 nissan gas std two hatchback rwd front 91.3 170.7 67.9 49.7 3071 ohcv six 181 mpfi 3.43 3.27 9.00 160 5200 19 25 17199
105 3 194 nissan gas turbo two hatchback rwd front 91.3 170.7 67.9 49.7 3139 ohcv six 181 mpfi 3.43 3.27 7.80 200 5200 17 23 19699
106 1 231 nissan gas std two hatchback rwd front 99.2 178.5 67.9 49.7 3139 ohcv six 181 mpfi 3.43 3.27 9.00 160 5200 19 25 18399
107 0 161 peugot gas std four sedan rwd front 107.9 186.7 68.4 56.7 3020 l four 120 mpfi 3.46 3.19 8.40 97 5000 19 24 11900
108 0 161 peugot diesel turbo four sedan rwd front 107.9 186.7 68.4 56.7 3197 l four 152 idi 3.70 3.52 21.00 95 4150 28 33 13200
109 0 NaN peugot gas std four wagon rwd front 114.2 198.9 68.4 58.7 3230 l four 120 mpfi 3.46 3.19 8.40 97 5000 19 24 12440
110 0 NaN peugot diesel turbo four wagon rwd front 114.2 198.9 68.4 58.7 3430 l four 152 idi 3.70 3.52 21.00 95 4150 25 25 13860
111 0 161 peugot gas std four sedan rwd front 107.9 186.7 68.4 56.7 3075 l four 120 mpfi 3.46 2.19 8.40 95 5000 19 24 15580
112 0 161 peugot diesel turbo four sedan rwd front 107.9 186.7 68.4 56.7 3252 l four 152 idi 3.70 3.52 21.00 95 4150 28 33 16900
113 0 NaN peugot gas std four wagon rwd front 114.2 198.9 68.4 56.7 3285 l four 120 mpfi 3.46 2.19 8.40 95 5000 19 24 16695
114 0 NaN peugot diesel turbo four wagon rwd front 114.2 198.9 68.4 58.7 3485 l four 152 idi 3.70 3.52 21.00 95 4150 25 25 17075
115 0 161 peugot gas std four sedan rwd front 107.9 186.7 68.4 56.7 3075 l four 120 mpfi 3.46 3.19 8.40 97 5000 19 24 16630
116 0 161 peugot diesel turbo four sedan rwd front 107.9 186.7 68.4 56.7 3252 l four 152 idi 3.70 3.52 21.00 95 4150 28 33 17950
117 0 161 peugot gas turbo four sedan rwd front 108.0 186.7 68.3 56.0 3130 l four 134 mpfi 3.61 3.21 7.00 142 5600 18 24 18150
118 1 119 plymouth gas std two hatchback fwd front 93.7 157.3 63.8 50.8 1918 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 37 41 5572
119 1 119 plymouth gas turbo two hatchback fwd front 93.7 157.3 63.8 50.8 2128 ohc four 98 spdi 3.03 3.39 7.60 102 5500 24 30 7957
120 1 154 plymouth gas std four hatchback fwd front 93.7 157.3 63.8 50.6 1967 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6229
121 1 154 plymouth gas std four sedan fwd front 93.7 167.3 63.8 50.8 1989 ohc four 90 2bbl 2.97 3.23 9.40 68 5500 31 38 6692
122 1 154 plymouth gas std four sedan fwd front 93.7 167.3 63.8 50.8 2191 ohc four 98 2bbl 2.97 3.23 9.40 68 5500 31 38 7609
123 -1 74 plymouth gas std four wagon fwd front 103.3 174.6 64.6 59.8 2535 ohc four 122 2bbl 3.35 3.46 8.50 88 5000 24 30 8921
124 3 NaN plymouth gas turbo two hatchback rwd front 95.9 173.2 66.3 50.2 2818 ohc four 156 spdi 3.59 3.86 7.00 145 5000 19 24 12764
125 3 186 porsche gas std two hatchback rwd front 94.5 168.9 68.3 50.2 2778 ohc four 151 mpfi 3.94 3.11 9.50 143 5500 19 27 22018
126 3 NaN porsche gas std two hardtop rwd rear 89.5 168.9 65.0 51.6 2756 ohcf six 194 mpfi 3.74 2.90 9.50 207 5900 17 25 32528
127 3 NaN porsche gas std two hardtop rwd rear 89.5 168.9 65.0 51.6 2756 ohcf six 194 mpfi 3.74 2.90 9.50 207 5900 17 25 34028
128 3 NaN porsche gas std two convertible rwd rear 89.5 168.9 65.0 51.6 2800 ohcf six 194 mpfi 3.74 2.90 9.50 207 5900 17 25 37028
129 1 NaN porsche gas std two hatchback rwd front 98.4 175.7 72.3 50.5 3366 dohcv eight 203 mpfi 3.94 3.11 10.00 288 5750 17 28 NaN
130 0 NaN renault gas std four wagon fwd front 96.1 181.5 66.5 55.2 2579 ohc four 132 mpfi 3.46 3.90 8.70 NaN NaN 23 31 9295
131 2 NaN renault gas std two hatchback fwd front 96.1 176.8 66.6 50.5 2460 ohc four 132 mpfi 3.46 3.90 8.70 NaN NaN 23 31 9895
132 3 150 saab gas std two hatchback fwd front 99.1 186.6 66.5 56.1 2658 ohc four 121 mpfi 3.54 3.07 9.31 110 5250 21 28 11850
133 2 104 saab gas std four sedan fwd front 99.1 186.6 66.5 56.1 2695 ohc four 121 mpfi 3.54 3.07 9.30 110 5250 21 28 12170
134 3 150 saab gas std two hatchback fwd front 99.1 186.6 66.5 56.1 2707 ohc four 121 mpfi 2.54 2.07 9.30 110 5250 21 28 15040
135 2 104 saab gas std four sedan fwd front 99.1 186.6 66.5 56.1 2758 ohc four 121 mpfi 3.54 3.07 9.30 110 5250 21 28 15510
136 3 150 saab gas turbo two hatchback fwd front 99.1 186.6 66.5 56.1 2808 dohc four 121 mpfi 3.54 3.07 9.00 160 5500 19 26 18150
137 2 104 saab gas turbo four sedan fwd front 99.1 186.6 66.5 56.1 2847 dohc four 121 mpfi 3.54 3.07 9.00 160 5500 19 26 18620
138 2 83 subaru gas std two hatchback fwd front 93.7 156.9 63.4 53.7 2050 ohcf four 97 2bbl 3.62 2.36 9.00 69 4900 31 36 5118
139 2 83 subaru gas std two hatchback fwd front 93.7 157.9 63.6 53.7 2120 ohcf four 108 2bbl 3.62 2.64 8.70 73 4400 26 31 7053
140 2 83 subaru gas std two hatchback 4wd front 93.3 157.3 63.8 55.7 2240 ohcf four 108 2bbl 3.62 2.64 8.70 73 4400 26 31 7603
141 0 102 subaru gas std four sedan fwd front 97.2 172.0 65.4 52.5 2145 ohcf four 108 2bbl 3.62 2.64 9.50 82 4800 32 37 7126
142 0 102 subaru gas std four sedan fwd front 97.2 172.0 65.4 52.5 2190 ohcf four 108 2bbl 3.62 2.64 9.50 82 4400 28 33 7775
143 0 102 subaru gas std four sedan fwd front 97.2 172.0 65.4 52.5 2340 ohcf four 108 mpfi 3.62 2.64 9.00 94 5200 26 32 9960
144 0 102 subaru gas std four sedan 4wd front 97.0 172.0 65.4 54.3 2385 ohcf four 108 2bbl 3.62 2.64 9.00 82 4800 24 25 9233
145 0 102 subaru gas turbo four sedan 4wd front 97.0 172.0 65.4 54.3 2510 ohcf four 108 mpfi 3.62 2.64 7.70 111 4800 24 29 11259
146 0 89 subaru gas std four wagon fwd front 97.0 173.5 65.4 53.0 2290 ohcf four 108 2bbl 3.62 2.64 9.00 82 4800 28 32 7463
147 0 89 subaru gas std four wagon fwd front 97.0 173.5 65.4 53.0 2455 ohcf four 108 mpfi 3.62 2.64 9.00 94 5200 25 31 10198
148 0 85 subaru gas std four wagon 4wd front 96.9 173.6 65.4 54.9 2420 ohcf four 108 2bbl 3.62 2.64 9.00 82 4800 23 29 8013
149 0 85 subaru gas turbo four wagon 4wd front 96.9 173.6 65.4 54.9 2650 ohcf four 108 mpfi 3.62 2.64 7.70 111 4800 23 23 11694
150 1 87 toyota gas std two hatchback fwd front 95.7 158.7 63.6 54.5 1985 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 35 39 5348
151 1 87 toyota gas std two hatchback fwd front 95.7 158.7 63.6 54.5 2040 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 31 38 6338
152 1 74 toyota gas std four hatchback fwd front 95.7 158.7 63.6 54.5 2015 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 31 38 6488
153 0 77 toyota gas std four wagon fwd front 95.7 169.7 63.6 59.1 2280 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 31 37 6918
154 0 81 toyota gas std four wagon 4wd front 95.7 169.7 63.6 59.1 2290 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 27 32 7898
155 0 91 toyota gas std four wagon 4wd front 95.7 169.7 63.6 59.1 3110 ohc four 92 2bbl 3.05 3.03 9.00 62 4800 27 32 8778
156 0 91 toyota gas std four sedan fwd front 95.7 166.3 64.4 53.0 2081 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 30 37 6938
157 0 91 toyota gas std four hatchback fwd front 95.7 166.3 64.4 52.8 2109 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 30 37 7198
158 0 91 toyota diesel std four sedan fwd front 95.7 166.3 64.4 53.0 2275 ohc four 110 idi 3.27 3.35 22.50 56 4500 34 36 7898
159 0 91 toyota diesel std four hatchback fwd front 95.7 166.3 64.4 52.8 2275 ohc four 110 idi 3.27 3.35 22.50 56 4500 38 47 7788
160 0 91 toyota gas std four sedan fwd front 95.7 166.3 64.4 53.0 2094 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 38 47 7738
161 0 91 toyota gas std four hatchback fwd front 95.7 166.3 64.4 52.8 2122 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 28 34 8358
162 0 91 toyota gas std four sedan fwd front 95.7 166.3 64.4 52.8 2140 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 28 34 9258
163 1 168 toyota gas std two sedan rwd front 94.5 168.7 64.0 52.6 2169 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 29 34 8058
164 1 168 toyota gas std two hatchback rwd front 94.5 168.7 64.0 52.6 2204 ohc four 98 2bbl 3.19 3.03 9.00 70 4800 29 34 8238
165 1 168 toyota gas std two sedan rwd front 94.5 168.7 64.0 52.6 2265 dohc four 98 mpfi 3.24 3.08 9.40 112 6600 26 29 9298
166 1 168 toyota gas std two hatchback rwd front 94.5 168.7 64.0 52.6 2300 dohc four 98 mpfi 3.24 3.08 9.40 112 6600 26 29 9538
167 2 134 toyota gas std two hardtop rwd front 98.4 176.2 65.6 52.0 2540 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 8449
168 2 134 toyota gas std two hardtop rwd front 98.4 176.2 65.6 52.0 2536 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 9639
169 2 134 toyota gas std two hatchback rwd front 98.4 176.2 65.6 52.0 2551 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 9989
170 2 134 toyota gas std two hardtop rwd front 98.4 176.2 65.6 52.0 2679 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 11199
171 2 134 toyota gas std two hatchback rwd front 98.4 176.2 65.6 52.0 2714 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 11549
172 2 134 toyota gas std two convertible rwd front 98.4 176.2 65.6 53.0 2975 ohc four 146 mpfi 3.62 3.50 9.30 116 4800 24 30 17669
173 -1 65 toyota gas std four sedan fwd front 102.4 175.6 66.5 54.9 2326 ohc four 122 mpfi 3.31 3.54 8.70 92 4200 29 34 8948
174 -1 65 toyota diesel turbo four sedan fwd front 102.4 175.6 66.5 54.9 2480 ohc four 110 idi 3.27 3.35 22.50 73 4500 30 33 10698
175 -1 65 toyota gas std four hatchback fwd front 102.4 175.6 66.5 53.9 2414 ohc four 122 mpfi 3.31 3.54 8.70 92 4200 27 32 9988
176 -1 65 toyota gas std four sedan fwd front 102.4 175.6 66.5 54.9 2414 ohc four 122 mpfi 3.31 3.54 8.70 92 4200 27 32 10898
177 -1 65 toyota gas std four hatchback fwd front 102.4 175.6 66.5 53.9 2458 ohc four 122 mpfi 3.31 3.54 8.70 92 4200 27 32 11248
178 3 197 toyota gas std two hatchback rwd front 102.9 183.5 67.7 52.0 2976 dohc six 171 mpfi 3.27 3.35 9.30 161 5200 20 24 16558
179 3 197 toyota gas std two hatchback rwd front 102.9 183.5 67.7 52.0 3016 dohc six 171 mpfi 3.27 3.35 9.30 161 5200 19 24 15998
180 -1 90 toyota gas std four sedan rwd front 104.5 187.8 66.5 54.1 3131 dohc six 171 mpfi 3.27 3.35 9.20 156 5200 20 24 15690
181 -1 NaN toyota gas std four wagon rwd front 104.5 187.8 66.5 54.1 3151 dohc six 161 mpfi 3.27 3.35 9.20 156 5200 19 24 15750
182 2 122 volkswagen diesel std two sedan fwd front 97.3 171.7 65.5 55.7 2261 ohc four 97 idi 3.01 3.40 23.00 52 4800 37 46 7775
183 2 122 volkswagen gas std two sedan fwd front 97.3 171.7 65.5 55.7 2209 ohc four 109 mpfi 3.19 3.40 9.00 85 5250 27 34 7975
184 2 94 volkswagen diesel std four sedan fwd front 97.3 171.7 65.5 55.7 2264 ohc four 97 idi 3.01 3.40 23.00 52 4800 37 46 7995
185 2 94 volkswagen gas std four sedan fwd front 97.3 171.7 65.5 55.7 2212 ohc four 109 mpfi 3.19 3.40 9.00 85 5250 27 34 8195
186 2 94 volkswagen gas std four sedan fwd front 97.3 171.7 65.5 55.7 2275 ohc four 109 mpfi 3.19 3.40 9.00 85 5250 27 34 8495
187 2 94 volkswagen diesel turbo four sedan fwd front 97.3 171.7 65.5 55.7 2319 ohc four 97 idi 3.01 3.40 23.00 68 4500 37 42 9495
188 2 94 volkswagen gas std four sedan fwd front 97.3 171.7 65.5 55.7 2300 ohc four 109 mpfi 3.19 3.40 10.00 100 5500 26 32 9995
189 3 NaN volkswagen gas std two convertible fwd front 94.5 159.3 64.2 55.6 2254 ohc four 109 mpfi 3.19 3.40 8.50 90 5500 24 29 11595
190 3 256 volkswagen gas std two hatchback fwd front 94.5 165.7 64.0 51.4 2221 ohc four 109 mpfi 3.19 3.40 8.50 90 5500 24 29 9980
191 0 NaN volkswagen gas std four sedan fwd front 100.4 180.2 66.9 55.1 2661 ohc five 136 mpfi 3.19 3.40 8.50 110 5500 19 24 13295
192 0 NaN volkswagen diesel turbo four sedan fwd front 100.4 180.2 66.9 55.1 2579 ohc four 97 idi 3.01 3.40 23.00 68 4500 33 38 13845
193 0 NaN volkswagen gas std four wagon fwd front 100.4 183.1 66.9 55.1 2563 ohc four 109 mpfi 3.19 3.40 9.00 88 5500 25 31 12290
194 -2 103 volvo gas std four sedan rwd front 104.3 188.8 67.2 56.2 2912 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 23 28 12940
195 -1 74 volvo gas std four wagon rwd front 104.3 188.8 67.2 57.5 3034 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 23 28 13415
196 -2 103 volvo gas std four sedan rwd front 104.3 188.8 67.2 56.2 2935 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 24 28 15985
197 -1 74 volvo gas std four wagon rwd front 104.3 188.8 67.2 57.5 3042 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 24 28 16515
198 -2 103 volvo gas turbo four sedan rwd front 104.3 188.8 67.2 56.2 3045 ohc four 130 mpfi 3.62 3.15 7.50 162 5100 17 22 18420
199 -1 74 volvo gas turbo four wagon rwd front 104.3 188.8 67.2 57.5 3157 ohc four 130 mpfi 3.62 3.15 7.50 162 5100 17 22 18950
200 -1 95 volvo gas std four sedan rwd front 109.1 188.8 68.9 55.5 2952 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 23 28 16845
201 -1 95 volvo gas turbo four sedan rwd front 109.1 188.8 68.8 55.5 3049 ohc four 141 mpfi 3.78 3.15 8.70 160 5300 19 25 19045
202 -1 95 volvo gas std four sedan rwd front 109.1 188.8 68.9 55.5 3012 ohcv six 173 mpfi 3.58 2.87 8.80 134 5500 18 23 21485
203 -1 95 volvo diesel turbo four sedan rwd front 109.1 188.8 68.9 55.5 3217 ohc six 145 idi 3.01 3.40 23.00 106 4800 26 27 22470
204 -1 95 volvo gas turbo four sedan rwd front 109.1 188.8 68.9 55.5 3062 ohc four 141 mpfi 3.78 3.15 9.50 114 5400 19 25 22625
In [5]:
car_Data.dtypes
Out[5]:
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object
In [14]:
car_Data.describe(include = 'all')
Out[14]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 205.000000 205 205 205 205 205 205 205 205 205.000000 205.000000 205.000000 205.000000 205.000000 205 205 205.000000 205 205 205 205.000000 205 205 205.000000 205.000000 205
unique NaN 52 22 2 2 3 5 3 2 NaN NaN NaN NaN NaN 7 7 NaN 8 39 37 NaN 60 24 NaN NaN 187
top NaN ? toyota gas std four sedan fwd front NaN NaN NaN NaN NaN ohc four NaN mpfi 3.62 3.40 NaN 68 5500 NaN NaN ?
freq NaN 41 32 185 168 114 96 120 202 NaN NaN NaN NaN NaN 148 159 NaN 94 23 20 NaN 19 37 NaN NaN 4
mean 0.834146 NaN NaN NaN NaN NaN NaN NaN NaN 98.756585 174.049268 65.907805 53.724878 2555.565854 NaN NaN 126.907317 NaN NaN NaN 10.142537 NaN NaN 25.219512 30.751220 NaN
std 1.245307 NaN NaN NaN NaN NaN NaN NaN NaN 6.021776 12.337289 2.145204 2.443522 520.680204 NaN NaN 41.642693 NaN NaN NaN 3.972040 NaN NaN 6.542142 6.886443 NaN
min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000 141.100000 60.300000 47.800000 1488.000000 NaN NaN 61.000000 NaN NaN NaN 7.000000 NaN NaN 13.000000 16.000000 NaN
25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000 166.300000 64.100000 52.000000 2145.000000 NaN NaN 97.000000 NaN NaN NaN 8.600000 NaN NaN 19.000000 25.000000 NaN
50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000 173.200000 65.500000 54.100000 2414.000000 NaN NaN 120.000000 NaN NaN NaN 9.000000 NaN NaN 24.000000 30.000000 NaN
75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000 183.100000 66.900000 55.500000 2935.000000 NaN NaN 141.000000 NaN NaN NaN 9.400000 NaN NaN 30.000000 34.000000 NaN
max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000 208.100000 72.300000 59.800000 4066.000000 NaN NaN 326.000000 NaN NaN NaN 23.000000 NaN NaN 49.000000 54.000000 NaN

Attribute Information:

  1. symboling: -3, -2, -1, 0, 1, 2, 3.
  2. normalized-losses: continuous from 65 to 256.
  3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo

  4. fuel-type: diesel, gas.

  5. aspiration: std, turbo.
  6. num-of-doors: four, two.
  7. body-style: hardtop, wagon, sedan, hatchback, convertible.
  8. drive-wheels: 4wd, fwd, rwd.
  9. engine-location: front, rear.
  10. wheel-base: continuous from 86.6 120.9.
  11. length: continuous from 141.1 to 208.1.
  12. width: continuous from 60.3 to 72.3.
  13. height: continuous from 47.8 to 59.8.
  14. curb-weight: continuous from 1488 to 4066.
  15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
  16. num-of-cylinders: eight, five, four, six, three, twelve, two.
  17. engine-size: continuous from 61 to 326.
  18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
  19. bore: continuous from 2.54 to 3.94.
  20. stroke: continuous from 2.07 to 4.17.
  21. compression-ratio: continuous from 7 to 23.
  22. horsepower: continuous from 48 to 288.
  23. peak-rpm: continuous from 4150 to 6600.
  24. city-mpg: continuous from 13 to 49.
  25. highway-mpg: continuous from 16 to 54.
  26. price: continuous from 5118 to 45400.

Data Prepration

Preprocessing to be done :

  1. Cleansing data
  2. Converting attributes to appropriate data type

Cleansing Data :

  • Replacing '?' with nulls
  • num-of-doors has an extra unique level ( need to replace )
In [66]:
# Replacing '?' with nulls#
car_Data = car_Data.replace('?', np.NaN)
In [6]:
# Checking for nulls
car_Data.isnull().sum()
Out[6]:
symboling            0 
normalized-losses    41
make                 0 
fuel-type            0 
aspiration           0 
num-of-doors         2 
body-style           0 
drive-wheels         0 
engine-location      0 
wheel-base           0 
length               0 
width                0 
height               0 
curb-weight          0 
engine-type          0 
num-of-cylinders     0 
engine-size          0 
fuel-system          0 
bore                 4 
stroke               4 
compression-ratio    0 
horsepower           2 
peak-rpm             2 
city-mpg             0 
highway-mpg          0 
price                4 
dtype: int64
In [122]:
car_Data = car_Data.dropna(subset=['price'])

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

In [25]:
msno.matrix(car_Data)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fa7940>

From the above plot, we can see that column normalized_losses has the maximum missing values

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

In [26]:
msno.heatmap(car_Data)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x12085358>

There seems to be high correlation of missing values between stroke & bore and horsepower & peak-rpm.

Need to convert the following data types to its appropriate type :

  • symboling to object/category
  • normalized-losses to int/float
  • bore, stroke, horsepower, peak-rpm and price to continous
In [123]:
car_Data['symboling'] = car_Data['symboling'].astype('object')
for column in ['normalized-losses','bore','stroke','peak-rpm','price','horsepower']:
    car_Data[column] = car_Data[column].astype('float')
In [10]:
car_Data.dtypes
Out[10]:
symboling            object 
normalized-losses    float64
make                 object 
fuel-type            object 
aspiration           object 
num-of-doors         object 
body-style           object 
drive-wheels         object 
engine-location      object 
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight          int64  
engine-type          object 
num-of-cylinders     object 
engine-size          int64  
fuel-system          object 
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg             int64  
highway-mpg          int64  
price                float64
dtype: object

Exploratory Data Analysis

Univariate Analysis

In [29]:
for column in car_Data.columns:
    if car_Data[column].dtype in (['int64','float64']):
        sns.distplot(car_Data[column].dropna(), kde=False, rug=True)
        plt.show()
    else:
        sns.countplot(car_Data[column] ,palette="deep")
        plt.xticks(rotation = 90)
        plt.show()

Analysis from the above plots :

  1. Toyota is the make of the car which has most number of vehicles with more than 40% than the 2nd highest Nissan
  2. Most preferred fuel type for the customer is standard vs trubo having more than 80% of the choice
  3. For drive wheels, front wheel drive has most number of cars followed by rear wheel and four wheel. There are very less number of cars for four wheel drive.
  4. Curb weight of the cars are distributed between 1500 and 4000 approximately
  5. Symboling or the insurance risk rating have the ratings between -3 and 3 however for our dataset it starts from -2. There are more cars in the range of 0 and 1.
  6. Normalized losses which is the average loss payment per insured vehicle year is has more number of cars in the range between 65 and 150.

Bi-Variate Analysis

Numerical VS Numerical Features

In [31]:
sns.pairplot(car_Data.dropna())
Out[31]:
<seaborn.axisgrid.PairGrid at 0x1fd3556fd0>
In [270]:
plt.figure(figsize=(16,16))
sns.heatmap(car_Data.corr())
plt.show()

Categorical VS Categorical Features

In [43]:
cat_Cols = car_Data.select_dtypes(include=['object']).columns
cat_Cols_Groups = list(itertools.combinations(cat_Cols, 2))
for group in cat_Cols_Groups:
    sns.countplot(x = group[0], hue=group[1], data=car_Data)
    plt.xticks(rotation = 90)
    plt.show()

Categorical VS Numerical

  • Symboling vs Numeric Features
  • Price vs Categorical Features
  • Normalized_loss vs Categorical Features
In [44]:
for column in car_Data.select_dtypes(include=['int64','float64']).columns:
    sns.boxplot(x = 'symboling', y= column, data = car_Data)
    plt.show()
In [46]:
for column in car_Data.select_dtypes(include=['object']).columns:
    sns.boxplot(x = column, y= 'price', data = car_Data)
    plt.show()
In [48]:
for column in car_Data.select_dtypes(include=['object']).columns:
    sns.boxplot(x = column, y= 'normalized-losses', data = car_Data)
    plt.show()

Data Prepration for Model Building

Preprocessing to be done :

  1. Convert categorical variables to numeric variables
  2. Impute missing values
  3. Scaling the features
In [124]:
# Convert categorical variables to numeric variables
cols_to_transform = car_Data.select_dtypes(include=['object']).columns
car_Data = pd.get_dummies(columns=cols_to_transform, data=car_Data, prefix=cols_to_transform, prefix_sep='_', drop_first=True)
In [125]:
cols = car_Data.columns
print(cols)
Index(['normalized-losses', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-ratio',
       'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price',
       'symboling_-1', 'symboling_0', 'symboling_1', 'symboling_2',
       'symboling_3', 'make_audi', 'make_bmw', 'make_chevrolet', 'make_dodge',
       'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda',
       'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan',
       'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault',
       'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen',
       'make_volvo', 'fuel-type_gas', 'aspiration_turbo', 'num-of-doors_two',
       'body-style_hardtop', 'body-style_hatchback', 'body-style_sedan',
       'body-style_wagon', 'drive-wheels_fwd', 'drive-wheels_rwd',
       'engine-location_rear', 'engine-type_l', 'engine-type_ohc',
       'engine-type_ohcf', 'engine-type_ohcv', 'engine-type_rotor',
       'num-of-cylinders_five', 'num-of-cylinders_four',
       'num-of-cylinders_six', 'num-of-cylinders_three',
       'num-of-cylinders_twelve', 'num-of-cylinders_two', 'fuel-system_2bbl',
       'fuel-system_4bbl', 'fuel-system_idi', 'fuel-system_mfi',
       'fuel-system_mpfi', 'fuel-system_spdi', 'fuel-system_spfi'],
      dtype='object')
In [126]:
car_Data.head()
Out[126]:
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price symboling_-1 symboling_0 symboling_1 symboling_2 symboling_3 make_audi make_bmw make_chevrolet make_dodge make_honda make_isuzu make_jaguar make_mazda make_mercedes-benz make_mercury make_mitsubishi make_nissan make_peugot make_plymouth make_porsche make_renault make_saab make_subaru make_toyota make_volkswagen make_volvo fuel-type_gas aspiration_turbo num-of-doors_two body-style_hardtop body-style_hatchback body-style_sedan body-style_wagon drive-wheels_fwd drive-wheels_rwd engine-location_rear engine-type_l engine-type_ohc engine-type_ohcf engine-type_ohcv engine-type_rotor num-of-cylinders_five num-of-cylinders_four num-of-cylinders_six num-of-cylinders_three num-of-cylinders_twelve num-of-cylinders_two fuel-system_2bbl fuel-system_4bbl fuel-system_idi fuel-system_mfi fuel-system_mpfi fuel-system_spdi fuel-system_spfi
0 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
1 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
2 NaN 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
3 164.0 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
4 164.0 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
In [127]:
car_Data_Imputed = pd.DataFrame(KNN(3).fit_transform(car_Data))
Imputing row 1/201 with 1 missing, elapsed time: 0.047
Imputing row 101/201 with 0 missing, elapsed time: 0.049
Imputing row 201/201 with 0 missing, elapsed time: 0.051
In [128]:
car_Data_Imputed.columns = cols
In [129]:
# Checking for nulls
car_Data_Imputed.isnull().sum()
Out[129]:
normalized-losses          0
wheel-base                 0
length                     0
width                      0
height                     0
curb-weight                0
engine-size                0
bore                       0
stroke                     0
compression-ratio          0
horsepower                 0
peak-rpm                   0
city-mpg                   0
highway-mpg                0
price                      0
symboling_-1               0
symboling_0                0
symboling_1                0
symboling_2                0
symboling_3                0
make_audi                  0
make_bmw                   0
make_chevrolet             0
make_dodge                 0
make_honda                 0
make_isuzu                 0
make_jaguar                0
make_mazda                 0
make_mercedes-benz         0
make_mercury               0
make_mitsubishi            0
make_nissan                0
make_peugot                0
make_plymouth              0
make_porsche               0
make_renault               0
make_saab                  0
make_subaru                0
make_toyota                0
make_volkswagen            0
make_volvo                 0
fuel-type_gas              0
aspiration_turbo           0
num-of-doors_two           0
body-style_hardtop         0
body-style_hatchback       0
body-style_sedan           0
body-style_wagon           0
drive-wheels_fwd           0
drive-wheels_rwd           0
engine-location_rear       0
engine-type_l              0
engine-type_ohc            0
engine-type_ohcf           0
engine-type_ohcv           0
engine-type_rotor          0
num-of-cylinders_five      0
num-of-cylinders_four      0
num-of-cylinders_six       0
num-of-cylinders_three     0
num-of-cylinders_twelve    0
num-of-cylinders_two       0
fuel-system_2bbl           0
fuel-system_4bbl           0
fuel-system_idi            0
fuel-system_mfi            0
fuel-system_mpfi           0
fuel-system_spdi           0
fuel-system_spfi           0
dtype: int64
In [248]:
cols_to_transform
Out[248]:
Index(['symboling', 'make', 'fuel-type', 'aspiration', 'num-of-doors',
       'body-style', 'drive-wheels', 'engine-location', 'engine-type',
       'num-of-cylinders', 'fuel-system'],
      dtype='object')
In [249]:
# Scaling the features
num_Cols = [x for x in car_Data_Imputed.columns if x not in ['symboling_-1', 'symboling_0', 'symboling_1', 'symboling_2', 'symboling_3', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel-type_gas','aspiration_turbo', 'num-of-doors_two', 'body-style_hardtop', 'body-style_hatchback', 'body-style_sedan', 'body-style_wagon', 'drive-wheels_fwd', 'drive-wheels_rwd', 'engine-location_rear', 'engine-type_l', 'engine-type_ohc', 'engine-type_ohcf', 'engine-type_ohcv', 'engine-type_rotor', 'num-of-cylinders_five', 'num-of-cylinders_four', 'num-of-cylinders_six', 'num-of-cylinders_three', 'num-of-cylinders_twelve', 'num-of-cylinders_two', 'fuel-system_2bbl', 'fuel-system_4bbl', 'fuel-system_idi', 'fuel-system_mfi', 'fuel-system_mpfi', 'fuel-system_spdi', 'fuel-system_spfi']]
cat_Cols = ['symboling_-1', 'symboling_0', 'symboling_1', 'symboling_2', 'symboling_3', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel-type_gas','aspiration_turbo', 'num-of-doors_two', 'body-style_hardtop', 'body-style_hatchback', 'body-style_sedan', 'body-style_wagon', 'drive-wheels_fwd', 'drive-wheels_rwd', 'engine-location_rear', 'engine-type_l', 'engine-type_ohc', 'engine-type_ohcf', 'engine-type_ohcv', 'engine-type_rotor', 'num-of-cylinders_five', 'num-of-cylinders_four', 'num-of-cylinders_six', 'num-of-cylinders_three', 'num-of-cylinders_twelve', 'num-of-cylinders_two', 'fuel-system_2bbl', 'fuel-system_4bbl', 'fuel-system_idi', 'fuel-system_mfi', 'fuel-system_mpfi', 'fuel-system_spdi', 'fuel-system_spfi']
scaler = StandardScaler()
price = car_Data_Imputed.price
car_Num_Data = car_Data_Imputed[num_Cols]
car_Cat_Data = car_Data_Imputed[cat_Cols]
car_Data_Scaled = pd.DataFrame(scaler.fit_transform(car_Num_Data))
car_Data_Scaled.columns = num_Cols
In [251]:
car_Data_Scaled = pd.concat([car_Data_Scaled, car_Cat_Data], axis=1)

#car_Data_Scaled.columns = cols
car_Data_Scaled['price'] = price
In [252]:
car_Data_Scaled.head()
Out[252]:
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price make_audi symboling_-1 symboling_0 symboling_1 symboling_2 symboling_3 make_bmw make_chevrolet make_dodge make_honda make_isuzu make_jaguar make_mazda make_mercedes-benz make_mercury make_mitsubishi make_nissan make_peugot make_plymouth make_porsche make_renault make_saab make_subaru make_toyota make_volkswagen make_volvo fuel-type_gas aspiration_turbo num-of-doors_two body-style_hardtop body-style_hatchback body-style_sedan body-style_wagon drive-wheels_fwd drive-wheels_rwd engine-location_rear engine-type_l engine-type_ohc engine-type_ohcf engine-type_ohcv engine-type_rotor num-of-cylinders_five num-of-cylinders_four num-of-cylinders_six num-of-cylinders_three num-of-cylinders_twelve num-of-cylinders_two fuel-system_2bbl fuel-system_4bbl fuel-system_idi fuel-system_mfi fuel-system_mpfi fuel-system_spdi fuel-system_spfi
0 0.084080 -1.685107 -0.439409 -0.853460 -2.034081 -0.014858 0.075389 0.526109 -1.830622 -0.291435 0.203141 -0.244878 -0.652249 -0.542288 13495.0 -0.175412 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.960973 -1.685107 -0.439409 -0.853460 -2.034081 -0.014858 0.075389 0.526109 -1.830622 -0.291435 0.203141 -0.244878 -0.652249 -0.542288 16500.0 -0.175412 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 1.251693 -0.710103 -0.244152 -0.185597 -0.559713 0.518080 0.606234 -2.422443 0.667260 -0.291435 1.356765 -0.244878 -0.964397 -0.689386 16500.0 -0.175412 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 1.226251 0.165748 0.195176 0.148335 0.218425 -0.423766 -0.431327 -0.518948 0.445929 -0.041121 -0.038315 0.801949 -0.184027 -0.100993 13950.0 5.700877 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.226251 0.099646 0.195176 0.243744 0.218425 0.520017 0.220165 -0.518948 0.445929 -0.541748 0.310455 0.801949 -1.120471 -1.277779 17450.0 5.700877 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

Split the data into Train and Test

In [276]:
X, y = car_Data_Scaled.loc[:, car_Data_Scaled.columns!='price'].values, car_Data_Scaled.loc[:,'price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=15)

Starting with a simple model !!!

Linear Regression

In [254]:
linReg = LinearRegression()
# create the RFE model for the svm classifier 
# and select attributes
linear_Model = linReg.fit(X_train, y_train)
y_train_pred = linear_Model.predict(X_train)
y_test_pred = linear_Model.predict(X_test)
Error Metrics for Regression
  • Mean Absolute Error (MAE):

$$MAE = \dfrac{1}{n}\times|\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}|$$

  • Mean Squared Error (MSE):

$$MSE = \dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2$$

  • Root Mean Squared Error (RMSE):

$$RMSE = \sqrt{\dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2}$$

  • Mean Absolute Percentage Error (MAPE):

$$MAPE = \dfrac{100}{n}\times\mid\dfrac{\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}}{y_{i}}\mid$$

In [256]:
print('The Mean absolute error on train data: {} \n'.format(mean_absolute_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute error on test data: {} \n'.format(mean_absolute_error(y_pred = y_test_pred, y_true = y_test)))

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print('The Mean absolute percentage error on train data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute percentage error on test data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_test_pred, y_true = y_test)))

print('The R2 Score on train data: {} \n'.format(r2_score(y_pred = y_train_pred, y_true = y_train)))
print('The R2 Score on test data: {} \n'.format(r2_score(y_pred = y_test_pred, y_true = y_test)))
The Mean absolute error on train data: 923.3857142857142 

The Mean absolute error on test data: 1586790752436320.5 

The Mean absolute percentage error on train data: 7.898540171762431 

The Mean absolute percentage error on test data: 10259253947171.098 

The R2 Score on train data: 0.9776519573739968 

The R2 Score on test data: -1.1390190813029911e+24 

Parameters

  • max_depth : int or None, optional (default=None)

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

In [287]:
# set of parameters to test
param_grid = {'max_depth': range(1,11)}
In [288]:
dt = tree.DecisionTreeRegressor()
GS = GridSearchCV(dt, param_grid, cv=10)
GS.fit(X_train, y_train)
Out[288]:
GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': range(1, 11)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)
In [289]:
GS.best_params_
Out[289]:
{'max_depth': 10}
In [291]:
importances = GS.best_estimator_.feature_importances_
indices = np.argsort(importances)[::-1]
pd.DataFrame([car_Data_Scaled.columns[indices],np.sort(importances)[::-1]])
Out[291]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
0 engine-size curb-weight normalized-losses length bore symboling_2 wheel-base peak-rpm height width horsepower fuel-type_gas symboling_-1 make_saab city-mpg engine-type_l body-style_wagon fuel-system_4bbl highway-mpg compression-ratio make_dodge body-style_hardtop body-style_sedan stroke aspiration_turbo symboling_1 make_mercury num-of-doors_two symboling_0 make_toyota num-of-cylinders_twelve body-style_hatchback price symboling_3 make_bmw make_chevrolet make_jaguar make_honda make_audi make_isuzu fuel-system_spdi make_mazda engine-type_ohc fuel-system_mfi fuel-system_idi fuel-system_2bbl num-of-cylinders_two num-of-cylinders_three num-of-cylinders_six num-of-cylinders_four num-of-cylinders_five engine-type_rotor engine-type_ohcv engine-type_ohcf engine-location_rear make_mercedes-benz drive-wheels_rwd drive-wheels_fwd make_volvo make_volkswagen make_subaru make_renault make_porsche fuel-system_mpfi make_peugot make_nissan make_mitsubishi make_plymouth
1 0.698564 0.199697 0.0371725 0.0301555 0.00692513 0.00542743 0.00527648 0.00419824 0.0032319 0.00268592 0.00216314 0.00131502 0.000689213 0.000456023 0.000404101 0.000385743 0.000353115 0.000251991 0.000211959 0.000113074 0.000107692 5.40081e-05 4.16701e-05 3.33514e-05 1.6839e-05 1.61359e-05 1.33375e-05 1.21704e-05 1.12725e-05 9.89964e-06 3.19591e-06 2.48422e-06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [292]:
dot_data = tree.export_graphviz(GS.best_estimator_, out_file=None, feature_names=car_Data_Scaled.drop(labels=['price'],axis=1).columns,class_names='price', filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data) 
graph
Out[292]:
Tree 0 engine-size ≤ 1.33 mse = 69961236.063 samples = 140 value = 13960.471 1 curb-weight ≤ -0.023 mse = 21683263.621 samples = 125 value = 11540.792 0->1 True 202 normalized-losses ≤ 0.178 mse = 16900564.649 samples = 15 value = 34124.467 0->202 False 2 curb-weight ≤ -0.657 mse = 4842844.724 samples = 75 value = 8511.68 1->2 123 length ≤ 1.147 mse = 12535693.808 samples = 50 value = 16084.46 1->123 3 curb-weight ≤ -0.937 mse = 824453.102 samples = 38 value = 7052.053 2->3 68 length ≤ 0.179 mse = 4534516.941 samples = 37 value = 10010.757 2->68 4 length ≤ -1.054 mse = 540181.638 samples = 25 value = 6621.04 3->4 47 make_subaru ≤ 0.5 mse = 326849.148 samples = 13 value = 7880.923 3->47 5 engine-type_ohc ≤ 0.5 mse = 416315.2 samples = 15 value = 6247.0 4->5 30 curb-weight ≤ -1.026 mse = 201334.09 samples = 10 value = 7182.1 4->30 6 city-mpg ≤ 2.157 mse = 272.25 samples = 2 value = 5134.5 5->6 9 curb-weight ≤ -1.214 mse = 260619.515 samples = 13 value = 6418.154 5->9 7 mse = 0.0 samples = 1 value = 5118.0 6->7 8 mse = 0.0 samples = 1 value = 5151.0 6->8 10 peak-rpm ≤ 0.697 mse = 197547.25 samples = 6 value = 6067.5 9->10 21 body-style_wagon ≤ 0.5 mse = 118952.204 samples = 7 value = 6718.714 9->21 11 symboling_1 ≤ 0.5 mse = 33708.0 samples = 4 value = 6361.0 10->11 18 normalized-losses ≤ 0.488 mse = 8372.25 samples = 2 value = 5480.5 10->18 12 body-style_sedan ≤ 0.5 mse = 2304.0 samples = 2 value = 6527.0 11->12 15 horsepower ≤ -0.924 mse = 10000.0 samples = 2 value = 6195.0 11->15 13 mse = 0.0 samples = 1 value = 6479.0 12->13 14 mse = 0.0 samples = 1 value = 6575.0 12->14 16 mse = 0.0 samples = 1 value = 6095.0 15->16 17 mse = 0.0 samples = 1 value = 6295.0 15->17 19 mse = 0.0 samples = 1 value = 5572.0 18->19 20 mse = 0.0 samples = 1 value = 5389.0 18->20 22 num-of-doors_two ≤ 0.5 mse = 74201.556 samples = 6 value = 6622.667 21->22 29 mse = 0.0 samples = 1 value = 7295.0 21->29 23 curb-weight ≤ -1.119 mse = 35896.222 samples = 3 value = 6469.667 22->23 26 curb-weight ≤ -1.178 mse = 65688.889 samples = 3 value = 6775.667 22->26 24 mse = 0.0 samples = 1 value = 6229.0 23->24 25 mse = 10404.0 samples = 2 value = 6590.0 23->25 27 mse = 0.0 samples = 1 value = 6529.0 26->27 28 mse = 52900.0 samples = 2 value = 6899.0 26->28 31 curb-weight ≤ -1.179 mse = 88559.609 samples = 8 value = 7002.875 30->31 44 body-style_wagon ≤ 0.5 mse = 10000.0 samples = 2 value = 7899.0 30->44 32 curb-weight ≤ -1.216 mse = 7314.667 samples = 3 value = 6731.0 31->32 37 height ≤ -0.662 mse = 66347.2 samples = 5 value = 7166.0 31->37 33 mse = 0.0 samples = 1 value = 6649.0 32->33 34 wheel-base ≤ -0.826 mse = 5929.0 samples = 2 value = 6772.0 32->34 35 mse = 0.0 samples = 1 value = 6695.0 34->35 36 mse = 0.0 samples = 1 value = 6849.0 34->36 38 mse = 0.0 samples = 1 value = 6692.0 37->38 39 normalized-losses ≤ -0.085 mse = 12722.75 samples = 4 value = 7284.5 37->39 40 make_honda ≤ 0.5 mse = 1670.222 samples = 3 value = 7346.333 39->40 43 mse = 0.0 samples = 1 value = 7099.0 39->43 41 mse = 529.0 samples = 2 value = 7372.0 40->41 42 mse = 0.0 samples = 1 value = 7295.0 40->42 45 mse = 0.0 samples = 1 value = 7799.0 44->45 46 mse = 0.0 samples = 1 value = 7999.0 44->46 48 highway-mpg ≤ 0.708 mse = 251446.331 samples = 11 value = 8024.818 47->48 65 length ≤ -0.753 mse = 1332.25 samples = 2 value = 7089.5 47->65 49 normalized-losses ≤ -0.929 mse = 191838.5 samples = 8 value = 8216.0 48->49 60 body-style_hatchback ≤ 0.5 mse = 53018.0 samples = 3 value = 7515.0 48->60 50 curb-weight ≤ -0.823 mse = 202500.0 samples = 2 value = 8808.0 49->50 53 make_mitsubishi ≤ 0.5 mse = 32522.889 samples = 6 value = 8018.667 49->53 51 mse = 0.0 samples = 1 value = 8358.0 50->51 52 mse = 0.0 samples = 1 value = 9258.0 50->52 54 horsepower ≤ -0.266 mse = 12944.24 samples = 5 value = 8084.6 53->54 59 mse = 0.0 samples = 1 value = 7689.0 53->59 55 body-style_sedan ≤ 0.5 mse = 11092.25 samples = 4 value = 8116.5 54->55 58 mse = 0.0 samples = 1 value = 7957.0 54->58 56 mse = 0.0 samples = 1 value = 8238.0 55->56 57 mse = 8228.667 samples = 3 value = 8076.0 55->57 61 length ≤ -0.602 mse = 4160.25 samples = 2 value = 7673.5 60->61 64 mse = 0.0 samples = 1 value = 7198.0 60->64 62 mse = 0.0 samples = 1 value = 7738.0 61->62 63 mse = 0.0 samples = 1 value = 7609.0 61->63 66 mse = 0.0 samples = 1 value = 7053.0 65->66 67 mse = 0.0 samples = 1 value = 7126.0 65->67 69 symboling_3 ≤ 0.5 mse = 2120424.49 samples = 30 value = 9438.1 68->69 110 length ≤ 0.272 mse = 7451878.571 samples = 7 value = 12465.0 68->110 70 horsepower ≤ -0.36 mse = 1136461.41 samples = 24 value = 8965.583 69->70 99 curb-weight ≤ -0.336 mse = 1590836.806 samples = 6 value = 11328.167 69->99 71 wheel-base ≤ -0.396 mse = 943113.763 samples = 13 value = 8407.923 70->71 88 make_subaru ≤ 0.5 mse = 563084.595 samples = 11 value = 9624.636 70->88 72 normalized-losses ≤ -1.025 mse = 296794.96 samples = 5 value = 7529.8 71->72 79 curb-weight ≤ -0.541 mse = 563914.438 samples = 8 value = 8956.75 71->79 73 mse = 0.0 samples = 1 value = 6785.0 72->73 74 curb-weight ≤ -0.331 mse = 197641.5 samples = 4 value = 7716.0 72->74 75 wheel-base ≤ -0.462 mse = 164086.889 samples = 3 value = 7558.333 74->75 78 mse = 0.0 samples = 1 value = 8189.0 74->78 76 mse = 3025.0 samples = 2 value = 7843.0 75->76 77 mse = 0.0 samples = 1 value = 6989.0 75->77 80 num-of-doors_two ≤ 0.5 mse = 12100.0 samples = 2 value = 7885.0 79->80 83 width ≤ -1.116 mse = 237342.333 samples = 6 value = 9314.0 79->83 81 mse = 0.0 samples = 1 value = 7995.0 80->81 82 mse = 0.0 samples = 1 value = 7775.0 80->82 84 mse = 0.0 samples = 1 value = 10295.0 83->84 85 stroke ≤ 0.541 mse = 53844.16 samples = 5 value = 9117.8 83->85 86 mse = 17161.0 samples = 2 value = 9364.0 85->86 87 mse = 10950.222 samples = 3 value = 8953.667 85->87 89 make_honda ≤ 0.5 mse = 294719.778 samples = 9 value = 9379.333 88->89 96 compression-ratio ≤ -0.454 mse = 281430.25 samples = 2 value = 10728.5 88->96 90 curb-weight ≤ -0.034 mse = 200425.234 samples = 8 value = 9258.625 89->90 95 mse = 0.0 samples = 1 value = 10345.0 89->95 91 curb-weight ≤ -0.285 mse = 122038.204 samples = 7 value = 9374.286 90->91 94 mse = 0.0 samples = 1 value = 8449.0 90->94 92 mse = 50644.24 samples = 5 value = 9198.6 91->92 93 mse = 30450.25 samples = 2 value = 9813.5 91->93 97 mse = 0.0 samples = 1 value = 11259.0 96->97 98 mse = 0.0 samples = 1 value = 10198.0 96->98 100 normalized-losses ≤ 0.849 mse = 620752.16 samples = 5 value = 10864.8 99->100 109 mse = 0.0 samples = 1 value = 13645.0 99->109 101 bore ≤ -0.573 mse = 143888.889 samples = 3 value = 11461.667 100->101 106 length ≤ -0.395 mse = 110.25 samples = 2 value = 9969.5 100->106 102 mse = 0.0 samples = 1 value = 10945.0 101->102 103 num-of-cylinders_two ≤ 0.5 mse = 15625.0 samples = 2 value = 11720.0 101->103 104 mse = 0.0 samples = 1 value = 11595.0 103->104 105 mse = 0.0 samples = 1 value = 11845.0 103->105 107 mse = 0.0 samples = 1 value = 9980.0 106->107 108 mse = 0.0 samples = 1 value = 9959.0 106->108 111 horsepower ≤ -0.052 mse = 1482916.667 samples = 3 value = 15375.0 110->111 116 height ≤ 0.341 mse = 814218.75 samples = 4 value = 10282.5 110->116 112 mse = 0.0 samples = 1 value = 16925.0 111->112 113 normalized-losses ≤ 0.765 mse = 422500.0 samples = 2 value = 14600.0 111->113 114 mse = 0.0 samples = 1 value = 15250.0 113->114 115 mse = 0.0 samples = 1 value = 13950.0 113->115 117 mse = 0.0 samples = 1 value = 8845.0 116->117 118 curb-weight ≤ -0.268 mse = 167222.222 samples = 3 value = 10761.667 116->118 119 mse = 0.0 samples = 1 value = 10245.0 118->119 120 compression-ratio ≤ 1.373 mse = 50625.0 samples = 2 value = 11020.0 118->120 121 mse = 0.0 samples = 1 value = 11245.0 120->121 122 mse = 0.0 samples = 1 value = 10795.0 120->122 124 normalized-losses ≤ 1.181 mse = 8746485.855 samples = 39 value = 15042.128 123->124 181 bore ≤ 0.209 mse = 8461163.636 samples = 11 value = 19780.0 123->181 125 wheel-base ≤ -0.008 mse = 6079670.968 samples = 32 value = 14235.969 124->125 168 bore ≤ -0.146 mse = 4385261.102 samples = 7 value = 18727.429 124->168 126 symboling_3 ≤ 0.5 mse = 3497346.243 samples = 12 value = 12707.917 125->126 147 curb-weight ≤ 0.213 mse = 5387520.36 samples = 20 value = 15152.8 125->147 127 drive-wheels_fwd ≤ 0.5 mse = 744544.4 samples = 5 value = 10957.0 126->127 136 normalized-losses ≤ 0.807 mse = 1709703.673 samples = 7 value = 13958.571 126->136 128 peak-rpm ≤ -0.454 mse = 67479.25 samples = 4 value = 11372.5 127->128 135 mse = 0.0 samples = 1 value = 9295.0 127->135 129 body-style_hardtop ≤ 0.5 mse = 43172.222 samples = 3 value = 11480.667 128->129 134 mse = 0.0 samples = 1 value = 11048.0 128->134 130 symboling_2 ≤ 0.5 mse = 5256.25 samples = 2 value = 11621.5 129->130 133 mse = 0.0 samples = 1 value = 11199.0 129->133 131 mse = 0.0 samples = 1 value = 11694.0 130->131 132 mse = 0.0 samples = 1 value = 11549.0 130->132 137 curb-weight ≤ 0.623 mse = 738765.0 samples = 6 value = 13535.0 136->137 146 mse = 0.0 samples = 1 value = 16500.0 136->146 138 body-style_hatchback ≤ 0.5 mse = 108545.5 samples = 4 value = 12963.0 137->138 143 curb-weight ≤ 0.713 mse = 36100.0 samples = 2 value = 14679.0 137->143 139 mse = 0.0 samples = 1 value = 13495.0 138->139 140 normalized-losses ≤ 0.512 mse = 18938.889 samples = 3 value = 12785.667 138->140 141 mse = 4556.25 samples = 2 value = 12696.5 140->141 142 mse = 0.0 samples = 1 value = 12964.0 140->142 144 mse = 0.0 samples = 1 value = 14869.0 143->144 145 mse = 0.0 samples = 1 value = 14489.0 143->145 148 highway-mpg ≤ -0.689 mse = 365738.889 samples = 3 value = 12478.333 147->148 153 peak-rpm ≤ 0.54 mse = 4788712.533 samples = 17 value = 15624.765 147->153 149 mse = 0.0 samples = 1 value = 13295.0 148->149 150 make_volkswagen ≤ 0.5 mse = 48400.0 samples = 2 value = 12070.0 148->150 151 mse = 0.0 samples = 1 value = 11850.0 150->151 152 mse = 0.0 samples = 1 value = 12290.0 150->152 154 width ≤ 0.196 mse = 3932802.781 samples = 14 value = 15050.071 153->154 165 symboling_2 ≤ 0.5 mse = 49088.889 samples = 3 value = 18306.667 153->165 155 stroke ≤ 0.446 mse = 1024.0 samples = 2 value = 18312.0 154->155 158 curb-weight ≤ 0.992 mse = 2519175.743 samples = 12 value = 14506.417 154->158 156 mse = 0.0 samples = 1 value = 18280.0 155->156 157 mse = 0.0 samples = 1 value = 18344.0 155->157 159 horsepower ≤ 0.002 mse = 2132793.76 samples = 5 value = 13623.8 158->159 162 city-mpg ≤ -1.12 mse = 1841270.408 samples = 7 value = 15136.857 158->162 160 mse = 0.0 samples = 1 value = 11900.0 159->160 161 mse = 1737402.688 samples = 4 value = 14054.75 159->161 163 mse = 202500.0 samples = 2 value = 13949.0 162->163 164 mse = 1706616.0 samples = 5 value = 15612.0 162->164 166 mse = 0.0 samples = 2 value = 18150.0 165->166 167 mse = 0.0 samples = 1 value = 18620.0 165->167 169 engine-type_ohc ≤ 0.5 mse = 357507.556 samples = 3 value = 16668.667 168->169 174 height ≤ -1.563 mse = 1843044.25 samples = 4 value = 20271.5 168->174 170 curb-weight ≤ 0.853 mse = 78400.0 samples = 2 value = 16278.0 169->170 173 mse = 0.0 samples = 1 value = 17450.0 169->173 171 mse = 0.0 samples = 1 value = 16558.0 170->171 172 mse = 0.0 samples = 1 value = 15998.0 170->172 175 normalized-losses ≤ 2.688 mse = 422500.0 samples = 2 value = 19049.0 174->175 178 length ≤ -0.11 mse = 274576.0 samples = 2 value = 21494.0 174->178 176 mse = 0.0 samples = 1 value = 19699.0 175->176 177 mse = 0.0 samples = 1 value = 18399.0 175->177 179 mse = 0.0 samples = 1 value = 22018.0 178->179 180 mse = 0.0 samples = 1 value = 20970.0 178->180 182 normalized-losses ≤ -0.862 mse = 4741231.25 samples = 4 value = 22457.5 181->182 189 aspiration_turbo ≤ 0.5 mse = 4149364.286 samples = 7 value = 18250.0 181->189 183 mse = 0.0 samples = 1 value = 18920.0 182->183 184 normalized-losses ≤ -0.567 mse = 759905.556 samples = 3 value = 23636.667 182->184 185 mse = 0.0 samples = 1 value = 22470.0 184->185 186 width ≤ 1.556 mse = 119025.0 samples = 2 value = 24220.0 184->186 187 mse = 0.0 samples = 1 value = 24565.0 186->187 188 mse = 0.0 samples = 1 value = 23875.0 186->188 190 width ≤ 0.912 mse = 18200.0 samples = 3 value = 16685.0 189->190 195 peak-rpm ≤ 0.488 mse = 4033129.688 samples = 4 value = 19423.75 189->195 191 mse = 0.0 samples = 1 value = 16515.0 190->191 192 peak-rpm ≤ 0.174 mse = 5625.0 samples = 2 value = 16770.0 190->192 193 mse = 0.0 samples = 1 value = 16695.0 192->193 194 mse = 0.0 samples = 1 value = 16845.0 192->194 196 fuel-system_idi ≤ 0.5 mse = 822838.889 samples = 3 value = 18356.667 195->196 201 mse = 0.0 samples = 1 value = 22625.0 195->201 197 height ≤ 1.119 mse = 2256.25 samples = 2 value = 18997.5 196->197 200 mse = 0.0 samples = 1 value = 17075.0 196->200 198 mse = 0.0 samples = 1 value = 19045.0 197->198 199 mse = 0.0 samples = 1 value = 18950.0 197->199 203 height ≤ 1.078 mse = 6451394.56 samples = 5 value = 29723.2 202->203 212 normalized-losses ≤ 0.291 mse = 7596788.49 samples = 10 value = 36325.1 202->212 204 symboling_0 ≤ 0.5 mse = 2627084.0 samples = 4 value = 30766.0 203->204 211 mse = 0.0 samples = 1 value = 25552.0 203->211 205 compression-ratio ≤ 1.336 mse = 215296.0 samples = 2 value = 32064.0 204->205 208 curb-weight ≤ 1.564 mse = 1669264.0 samples = 2 value = 29468.0 204->208 206 mse = 0.0 samples = 1 value = 32528.0 205->206 207 mse = 0.0 samples = 1 value = 31600.0 205->207 209 mse = 0.0 samples = 1 value = 30760.0 208->209 210 mse = 0.0 samples = 1 value = 28176.0 208->210 213 curb-weight ≤ 2.101 mse = 31506.25 samples = 2 value = 41137.5 212->213 216 normalized-losses ≤ 0.608 mse = 2250861.0 samples = 8 value = 35122.0 212->216 214 mse = 0.0 samples = 1 value = 41315.0 213->214 215 mse = 0.0 samples = 1 value = 40960.0 213->215 217 normalized-losses ≤ 0.351 mse = 1225737.633 samples = 7 value = 35532.286 216->217 230 mse = 0.0 samples = 1 value = 32250.0 216->230 218 mse = 0.0 samples = 1 value = 34028.0 217->218 219 horsepower ≤ 2.027 mse = 990023.667 samples = 6 value = 35783.0 217->219 220 height ≤ 0.362 mse = 318930.667 samples = 3 value = 34930.0 219->220 225 horsepower ≤ 3.516 mse = 205898.667 samples = 3 value = 36636.0 219->225 221 stroke ≤ 1.189 mse = 61009.0 samples = 2 value = 35303.0 220->221 224 mse = 0.0 samples = 1 value = 34184.0 220->224 222 mse = 0.0 samples = 1 value = 35056.0 221->222 223 mse = 0.0 samples = 1 value = 35550.0 221->223 226 compression-ratio ≤ -0.354 mse = 5476.0 samples = 2 value = 36954.0 225->226 229 mse = 0.0 samples = 1 value = 36000.0 225->229 227 mse = 0.0 samples = 1 value = 36880.0 226->227 228 mse = 0.0 samples = 1 value = 37028.0 226->228
In [294]:
y_train_pred = GS.predict(X_train)
y_test_pred = GS.predict(X_test)
In [295]:
print('The Mean absolute error on train data: {} \n'.format(mean_absolute_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute error on test data: {} \n'.format(mean_absolute_error(y_pred = y_test_pred, y_true = y_test)))

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print('The Mean absolute percentage error on train data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute percentage error on test data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_test_pred, y_true = y_test)))

print('The R2 Score on train data: {} \n'.format(r2_score(y_pred = y_train_pred, y_true = y_train)))
print('The R2 Score on test data: {} \n'.format(r2_score(y_pred = y_test_pred, y_true = y_test)))
The Mean absolute error on train data: 98.25761904761904 

The Mean absolute error on test data: 1540.4240437158471 

The Mean absolute percentage error on train data: 0.7948730018627633 

The Mean absolute percentage error on test data: 13.140648828923112 

The R2 Score on train data: 0.9983218768664412 

The R2 Score on test data: 0.8859327072174835 

After, all the experience as a data scientist, I know that tree based ensembling models like Random Forest, xgboost will perform better in most of the situations as well as keeping the model simple by adding regularization and as well as fast during production by the use of GPU's.

Parameters

  • n_estimators : integer, optional (default=10).

    The number of trees in the forest.

  • max_depth : integer or None, optional (default=None)

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • max_features : int, float, string or None, optional (default=”auto”)

    The number of features to consider when looking for the best split.

  • min_samples_leaf : int, float, optional (default=1)

    The minimum number of samples required to be at a leaf node

In [296]:
# set of parameters to test
param_grid = { 
           "n_estimators" : [250, 300],
           "max_depth" : [1,5,10],
            "max_features" : [3, 5],
           "min_samples_leaf" : [1, 2, 4, 6, 8, 10]}
In [297]:
rf = RandomForestRegressor()
GS = GridSearchCV(rf, param_grid, cv=10)
GS.fit(X_train, y_train)
Out[297]:
GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [250, 300], 'max_depth': [1, 5, 10], 'max_features': [3, 5], 'min_samples_leaf': [1, 2, 4, 6, 8, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
In [298]:
GS.best_params_
Out[298]:
{'max_depth': 10,
 'max_features': 5,
 'min_samples_leaf': 1,
 'n_estimators': 300}
In [299]:
y_train_pred = GS.predict(X_train)
y_test_pred = GS.predict(X_test)
In [300]:
print('The Mean absolute error on train data: {} \n'.format(mean_absolute_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute error on test data: {} \n'.format(mean_absolute_error(y_pred = y_test_pred, y_true = y_test)))

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print('The Mean absolute percentage error on train data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute percentage error on test data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_test_pred, y_true = y_test)))

print('The R2 Score on train data: {} \n'.format(r2_score(y_pred = y_train_pred, y_true = y_train)))
print('The R2 Score on test data: {} \n'.format(r2_score(y_pred = y_test_pred, y_true = y_test)))
The Mean absolute error on train data: 649.4690236913733 

The Mean absolute error on test data: 1393.9668683756806 

The Mean absolute percentage error on train data: 4.725909207955167 

The Mean absolute percentage error on test data: 11.87752985195947 

The R2 Score on train data: 0.9880114955443672 

The R2 Score on test data: 0.8961163764112643 

Xgboost Model

In [257]:
def runXGB(train_X, train_y, test_X, test_y=None):
        params = {}
        params["objective"] = "reg:linear"
        params["eta"] = 0.002
        params["min_child_weight"] = 1
        params["subsample"] = 0.9
        params["colsample_bytree"] = 0.8
        params["silent"] = 1
        params["max_depth"] = 8
        params["seed"] = 1
        plst = list(params.items())
        num_rounds = 2500

        xgtrain = xgb.DMatrix(train_X, label=train_y, feature_names=cols[cols!='price'])
        xgtest = xgb.DMatrix(test_X, feature_names=cols[cols!='price'])
        model = xgb.train(plst, xgtrain, num_rounds)
        pred_test_y = model.predict(xgtest)
        return pred_test_y, model
In [258]:
y_train_pred, model = runXGB(np.array(X_train), y_train, np.array(X_train))
y_test_pred, model = runXGB(np.array(X_train), y_train, np.array(X_test))
In [259]:
print('The Mean absolute error on train data: {} \n'.format(mean_absolute_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute error on test data: {} \n'.format(mean_absolute_error(y_pred = y_test_pred, y_true = y_test)))

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print('The Mean absolute percentage error on train data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute percentage error on test data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_test_pred, y_true = y_test)))

print('The R2 Score on train data: {} \n'.format(r2_score(y_pred = y_train_pred, y_true = y_train)))
print('The R2 Score on test data: {} \n'.format(r2_score(y_pred = y_test_pred, y_true = y_test)))
The Mean absolute error on train data: 267.87432686941963 

The Mean absolute error on test data: 1261.504786757172 

The Mean absolute percentage error on train data: 1.7895147927975796 

The Mean absolute percentage error on test data: 10.754759611718828 

The R2 Score on train data: 0.99728537315 

The R2 Score on test data: 0.9197314390292222 

In [260]:
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
In [261]:
xgb.to_graphviz(model,)
Out[261]:
%3 0 engine-size<1.33011389 1 curb-weight<-0.0226094555 0->1 yes, missing 2 leaf=64.9435425 0->2 no 3 curb-weight<-0.660196066 1->3 yes, missing 4 leaf=31.2618141 1->4 no 5 leaf=13.6641951 3->5 yes, missing 6 leaf=19.6130886 3->6 no

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods [1-7] and representing the only possible consistent and locally accurate additive feature attribution method based on expectations (see the SHAP NIPS paper for details). Link : https://github.com/slundberg/shap

In [204]:
# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(pd.DataFrame(X_train, columns=cols[cols!='price']))

# visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0,:], pd.DataFrame(X_train, columns=cols[cols!='price']).iloc[0,:])
C:\Users\KumarM1\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\plots\force.py:155: ResourceWarning: unclosed file <_io.TextIOWrapper name='C:\\Users\\KumarM1\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\shap\\plots\\resources\\bundle.js' mode='r' encoding='utf-8'>
  bundle_data = io.open(bundle_path, encoding="utf-8").read()
C:\Users\KumarM1\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\plots\force.py:157: ResourceWarning: unclosed file <_io.BufferedReader name='C:\\Users\\KumarM1\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\shap\\plots\\resources\\logoSmallGray.png'>
  logo_data = base64.b64encode(open(logo_path, "rb").read()).decode('utf-8')
Out[204]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.

In [208]:
shap.summary_plot(shap_values, pd.DataFrame(X_train, columns=cols[cols!='price']))

To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low). This reveals for example that a high curb-weight increases the predicted car price.

In [193]:
sns.scatterplot(car_Data['engine-size'], car_Data['curb-weight'], hue=car_Data['price'])
Out[193]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e6106d8>

Using the above plots, we can infer that the price of a car can be predicted just by using the curb-weight and engine-size with less error.

We can remove insurance-related features like symbolling and normalized-losses in predicting the price of a car. In-turn, we can use price, symboling and normalized-losses to buy a car from a seller.

Building a regression model to predict the normalized-losses

Split the data into Train and Test

In [262]:
X, y = car_Data_Scaled.loc[:, [x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']]].values, car_Data_Scaled.loc[:,'normalized-losses'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=15)
In [263]:
def runXGB(train_X, train_y, test_X, test_y=None):
        params = {}
        params["objective"] = "reg:linear"
        params["eta"] = 0.002
        params["min_child_weight"] = 1
        params["subsample"] = 0.9
        params["colsample_bytree"] = 0.8
        params["silent"] = 1
        params["max_depth"] = 8
        params["seed"] = 1
        plst = list(params.items())
        num_rounds = 2500

        xgtrain = xgb.DMatrix(train_X, label=train_y, feature_names=[x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']])
        xgtest = xgb.DMatrix(test_X, feature_names=[x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']])
        model = xgb.train(plst, xgtrain, num_rounds)
        pred_test_y = model.predict(xgtest)
        return pred_test_y, model
In [264]:
y_train_pred, model = runXGB(np.array(X_train), y_train, np.array(X_train))
y_test_pred, model = runXGB(np.array(X_train), y_train, np.array(X_test))
In [265]:
print('The Mean absolute error on train data: {} \n'.format(mean_absolute_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute error on test data: {} \n'.format(mean_absolute_error(y_pred = y_test_pred, y_true = y_test)))

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print('The Mean absolute percentage error on train data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_train_pred, y_true = y_train)))
print('The Mean absolute percentage error on test data: {} \n'.format(mean_absolute_percentage_error(y_pred = y_test_pred, y_true = y_test)))

print('The R2 Score on train data: {} \n'.format(r2_score(y_pred = y_train_pred, y_true = y_train)))
print('The R2 Score on test data: {} \n'.format(r2_score(y_pred = y_test_pred, y_true = y_test)))
The Mean absolute error on train data: 0.04406405260419969 

The Mean absolute error on test data: 0.31092898804246344 

The Mean absolute percentage error on train data: 17.349156846202764 

The Mean absolute percentage error on test data: 60.35000328133911 

The R2 Score on train data: 0.9935189283042111 

The R2 Score on test data: 0.8023596164514923 

In [266]:
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
In [267]:
xgb.to_graphviz(model)
Out[267]:
%3 0 height<-0.621145487 1 make_volkswagen<0.5 0->1 yes, missing 2 make_peugot<0.5 0->2 no 3 curb-weight<0.766137481 1->3 yes, missing 4 leaf=0.00349943293 1->4 no 5 make_bmw<0.5 2->5 yes, missing 6 leaf=0.00109129772 2->6 no 7 height<-1.17403316 3->7 yes, missing 8 curb-weight<1.65953398 3->8 no 11 peak-rpm<0.278535396 7->11 yes, missing 12 curb-weight<-0.439269394 7->12 no 13 leaf=0.002536254 8->13 yes, missing 14 leaf=6.30987342e-05 8->14 no 19 curb-weight<0.501606882 11->19 yes, missing 20 fuel-system_mpfi<0.5 11->20 no 21 leaf=-0.00201538159 12->21 yes, missing 22 curb-weight<0.430871606 12->22 no 29 length<-1.42383087 19->29 yes, missing 30 curb-weight<0.522924364 19->30 no 31 make_dodge<0.5 20->31 yes, missing 32 leaf=-0.000660340069 20->32 no 43 leaf=-8.76176709e-05 29->43 yes, missing 44 length<-0.260423779 29->44 no 45 leaf=-0.000129951062 30->45 yes, missing 46 leaf=-0.00050915702 30->46 no 55 leaf=3.00351785e-05 44->55 yes, missing 56 leaf=0.000153528585 44->56 no 47 leaf=0.000812901882 31->47 yes, missing 48 leaf=0.000325277913 31->48 no 33 length<0.0161904693 22->33 yes, missing 34 leaf=-0.00011275697 22->34 no 49 leaf=-0.000644648855 33->49 yes, missing 50 leaf=-0.000284876034 33->50 no 9 highway-mpg<-1.42487741 5->9 yes, missing 10 length<0.707726121 5->10 no 15 length<2.18842602 9->15 yes, missing 16 make_audi<2.76273274 9->16 no 17 leaf=0.00201330171 10->17 yes, missing 18 length<1.39926171 10->18 no 23 length<1.78570819 15->23 yes, missing 24 length<2.53419399 15->24 no 25 peak-rpm<2.47687268 16->25 yes, missing 26 length<0.223651171 16->26 no 35 leaf=0.000545391231 23->35 yes, missing 36 leaf=0.000123574704 23->36 no 37 leaf=-4.91119936e-05 24->37 yes, missing 38 leaf=-0.000231404076 24->38 no 39 make_saab<0.5 25->39 yes, missing 40 leaf=0.000846824085 25->40 no 41 leaf=0.000968334556 26->41 yes, missing 42 length<0.878576159 26->42 no 51 make_nissan<0.5 39->51 yes, missing 52 body-style_hatchback<0.5 39->52 no 57 leaf=-0.00232289452 51->57 yes, missing 58 leaf=-0.00123684132 51->58 no 59 leaf=-0.001623519 52->59 yes, missing 60 leaf=0.000456367503 52->60 no 53 leaf=-0.00019553701 42->53 yes, missing 54 leaf=-0.00137026294 42->54 no 27 leaf=-0.00111797452 18->27 yes, missing 28 leaf=-0.000207213394 18->28 no
In [268]:
# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(pd.DataFrame(X_train, columns=[x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']]))

# visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0,:], pd.DataFrame(X_train, columns=[x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']]).iloc[0,:])
C:\Users\KumarM1\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\plots\force.py:155: ResourceWarning: unclosed file <_io.TextIOWrapper name='C:\\Users\\KumarM1\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\shap\\plots\\resources\\bundle.js' mode='r' encoding='utf-8'>
  bundle_data = io.open(bundle_path, encoding="utf-8").read()
C:\Users\KumarM1\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\plots\force.py:157: ResourceWarning: unclosed file <_io.BufferedReader name='C:\\Users\\KumarM1\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\shap\\plots\\resources\\logoSmallGray.png'>
  logo_data = base64.b64encode(open(logo_path, "rb").read()).decode('utf-8')
Out[268]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [269]:
shap.summary_plot(shap_values, pd.DataFrame(X_train, columns=[x for x in car_Data_Scaled.columns if x not in ['normalized-losses','price','symboling_-1','symboling_0', 'symboling_1', 'symboling_2','symboling_3']]))

This reveals that a high height increases the normalized-losses.